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Abstract Context of data points, which is usually defined as the other data 
points in a data set, has been found to paly important roles in data represen¬ 
tation and classification. In this paper, we study the problem of using context 
of a data point for its classification problem. Our work is inspired by the ob¬ 
servation that actually only very few data points are critical in the context of 
a data point for its representation and classification. We propose to represent 
a data point as the sparse linear combination of its context, and learn the 
sparse context in a supervised way to increase its discriminative ability. To 
this end, we proposed a novel formulation for context learning, by modeling 
the learning of context parameter and classifier in a unified objective, and 
optimizing it with an alternative strategy in an iterative algorithm. Experi¬ 
ments on three benchmark data set show its advantage over state-of-the-art 
context-based data representation and classification methods. 
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1 Introduction 

Pattern classification is a major problem in machine learning research mm 
This problem is defined as a problem of predicting a binary 
class label of a given data point. There are many examples of this problem 
in real-world applications. For example, in computer vision area, given an 

image of face, we may want to predict whose face it is mmmmmmm- 

In natural language processing applications, given a text, we also want to 
predict which topic it is about EmamiMmiin]. Moreover, in applications 
of wireless sensor network, it is important to detect if one node is normal or 
at fault. To solve this problem, we usually first represent the data point as a 
feature vector, and then learn a classifier function to predict the class label 
from its feature vector. The two most important topics of pattern classification 
are data representation and classifier learning. Most data representation and 
classification methods are based on single data point. When one data point 
is considered for representation and classification, all other data points are 
ignored. For example, in the most popular data representation method, feature 
selection scheme, when we have a feature vector a one data point, we simply 
reduce the abandoned features, and re-organize the remaining feature to a 
new feature vector to obtain the representation of the data point [31117] . In 
this procedure, no other data points are considered beside the data point to 
represent. Another example is the most classification method, support vector 
machine (SVM). When we have a test, a linear function is applied to its feature 
vector to predict its class label mm- In this procedure, no other data points 
are considered. However, the other data points other than the data point under 
consideration may play important roles in its representation and classification. 
These data points are called “context” of the considered data point. A data 
point may have different true nature in different context. Thus it is necessary to 
explore the contexts of data points when they are represented and/or classified. 
To this end, some methods have been proposed to use the context of a data 
point for its representation and classification. In this paper, we investigate the 
problem of learning effective representation of a data point from its context 
guided by its class label, and proposed a novel supervised context learning 
method using sparse regularization and linear classifier learning formulation. 


1.1 Related works 

This paper is to explore the context information for data representation and 
classification, thus we give some brief review of existing context-based data 
representation and classification methods. 

— The most popular context-based data classification is k nearest neighbor 
classification (KNN). Given a test data point and a training set, we first 
search the training set to find the k nearest neighbors of the test data point 
to present its context, and then we determine its class label by a majority 
vote of the the labels of the context Hi- All the data points of the context 
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contribute equally to the final classification result, and no representation 
procedure is needed. 

— Wright et al. [48] proposed sparse representation based classification (SRBC), 
to use the data points of one class as a context of a test data point, and 
reconstruct it by its context. The reconstruction coefficients are imposed 
to be sparse. Moreover, the class with the minimum reconstruction error is 
assigned to the test data point. This method does not require to learn an 
explicate classifier to predict the class label. Thus it cannot take advantages 
of the classifier learning technologies. 

— Melacci and Belkin [25] proposed Laplacian support vector machine (LSVM), 
to use the k nearest neighbors of a training data point to present its con¬ 
text, and learn a linear classifier to respect the context. Specifically, the 
classification result of a training data point is imposed to be similar to its 
contextual data points. However, after the classifier is trained, and used to 
classify a test data point, the context of the test data point is ignored. 

— Gao et al. [5] proposed Laplacian sparse coding (LSC) to represent the 
context of a data point by using its k nearest neighbors, and represent the 
data points with regard to the contexts. Each data point is reconstructed 
as a linear combination of the codewords of a dictionary, and the combi¬ 
nation coefficients are imposed to be sparse. Moreover, the combination 
coefficients of a data point are impose to be similar to these of its contex¬ 
tual data points. This method is unsupervised simply a data representation 
method, and the class label information is ignored. 


1.2 Contributions 

We propose a novel method to explore the context of a data point, and use it 
to represent it. Moreover, a linear classifier function is learned to predict its 
class label from its representation based on its context. We use its k nearest 
neighbors as its context, and try to reconstruct it by the data points in its 
context. The reconstruction errors are imposed to be spares, and we measure 
the sparsity by a i\ norm regularization, similar to sparse coding PH1I281IM1 
mm- Moreover, the reconstruction result is used as the new representation 
of this data point. We apply a linear function to predict its class label. To 
learn the reconstruction coefficient vectors of the data points and the classi¬ 
fier parameter vector, we build a unified objective function. In this function, 
the reconstruction error are measured by a squared i 2 norm distance, and 
the classification error is measured by the hinge loss. Moreover, the t\ norm 
regularization is applied to the reconstruction coefficient vectors to encourage 
their sparsity, and the squared i 2 norm regularization is applied to the classi¬ 
fier parameter vector to reduce the complexity of the classifier. By optimizing 
the objective function with regard to both the reconstruction coefficient vec¬ 
tors and the classifier parameter vector, the context based representation and 
classier are learned simultaneously. In this way, the context and the classifier 
can regularize the learning of each other. To minimize the proposed objec- 
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tive function, we use the Lagrange multiplier and an alternate optimization 
method, and develop an iterative algorithm based on the optimization results. 
The contributions of this paper are of two folds: 

1. We propose a novel context representation formulation. A data point is 
represented by its sparse reconstruction of its context. The motivation of 
this contribution is that for each data point, only a few data points in 
its context is of the same class as itself. However, it is critical to find 
which data points plays the most important roles in its context for the 
classification of the data point itself. To find the critical contextual data 
points, we proposed to learn the classifier together with the sparse context. 
The classifier can be used to regularize the learning of the reconstruction 
coefficient vector, and thus find the critical data points in the context. 
We mode this problem as a minimization problem. In this problem, the 
context reconstruction error, reconstruction sparsity, classification error, 
and classifier complexity are minimized simultaneously. 

2. We also problem a novel iterative algorithm to solve this minimization 
problem. We first reformulate it as its Lagrange formula, and the use an 
alterative optimization method to solve it. In each iteration, we first fix the 
classifier parameter vector to update the reconstruction vectors, and then 
fix the reconstruction vectors to update the classifier parameter vector. 


1.3 Paper organization 

This paper is organized as follows. In section [2] we introduce the proposed 
method. In section [3l we evaluate the proposed method experimentally. In 
section |4j this paper is concluded with future works. 


2 Proposed method 

In this section, we introduce the proposed classification method which explores 
the context information. The learning problem is firstly formulated by model¬ 
ing an objective function, and then it is optimized in an iterative algorithm. 


2.1 Problem formulation 

We consider a binary classification problem, and a training set of n data points 
are given as {xi}" =1 , where x^ S R d is a d-dimensional feature vector of the 
i-th data point. The binary class labels of the training points are given as 
{yi}?= i and Vi € {+1, —1} is the class label of the i-th point. To learn from 
the context of the *-th data point, we find its k nearest neighbors and denote 
them as {x y - }A =1 , where x,; ; is the j-th nearest neighbor of the i-tli point. 
They are further organized as a d x k matrix X, i: = [xji,-- - , x^] e R dxk , 
where the j-th column is x, ; . The k nearest neighbors of the i-th point is used 
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to represent its context information. We represent x, by linearly reconstructing 
it from its contextual points as 


k 

Xj « X, = ^2 X ij V ij = X Xi (1) 

3 =1 

where x,; is its reconstruction, and Vij is the reconstruction coefficient of the 
j-th nearest neighbor, v* = [vn,--- ,Vik] T G is the reconstruction coef¬ 
ficient vector of the i-tli data point. The reconstruction coefficient vectors 
of all the training points are organized in reconstruction coefficient matrix 
V = [vi,-- - ,v n ] G R fexn , with its i-th column as Vj. The key idea of this 
method is an assumption the for both the reconstruction and classification of 
Xj, only a few of its nearest neighbors play important role, while the remain¬ 
ing neighbors could be discarded, resulting a sparse context. To encourage 
the sparsity of the context, we impose a l\ norm penalty to the contextual 
reconstruction coefficient vector v*. Moreover, to learn the contextual recon¬ 
struction coefficient vectors, we also propose to minimized the reconstruction 
error measured by a squared ti norm penalty between x, and X^y i: . and the 
following optimization problem is obtained, 


min 

v 


ll x * - 

2=1 


XiViWl + 'r'Z 
2=1 



( 2 ) 


where (3 and 7 are trade-off parameters. 

To classify x, , instead of applying a classifier to x, itself, we apply a linear 
classifier to its contextual reconstruction x,;. The classifier is defined as 


f(%) = w t x 4 = w T XiVi (3) 

where w G M d is the classifier parameter vector. To learn the classifier, we 
consider the hinge-loss function and the squared (.2 norm regularization simul¬ 
taneously. The following optimization problem is obtained with regard to the 
classifier learning, 


s.t. 1 — 3 /* (w T Xiv) < & > 0 , i = 1 , • • • , n, 


(4) 


where ^||’w||| is the the squared 1 2 norm regularization term to reduce the 
complexity of the classifier, is the slack variable for the hinge loss of the i-th 
training point, £ = [£ 1 , • ■ • , £ n ] T and a is a tradeoff parameter. 

The overall optimization problem is obtained by combining the problems 
in both blj) and © as 
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I It lb IL 

min < 2 ll w ll 2 + aX]^ + ^lIll Xi_XiVi ll 2 +'THll Vi ll 1 

W ’ l Z 2 = 1 2=1 2=1 

s.t. I-yi (w T Xjv) < £», & > 0, i = 1, • ■ • , n. 

From the above problem, we can see that by encouraging the sparsity of v», 
we learn a sparse context for both the reconstruction and classification of x,;. 



2.2 Optimization 

To optimize the constrained problem in (|5jl . we write the Lagrange function 
of this problem as 


11 ii it 

£(w, V,£,7,<S) = -||w||| + + /3 H x * - + 7^1 IKIli 

2=1 2=1 2=1 

n n 

+ ^2 Si (i - Vi (w T XjVj) - £i) - ^2 
2=1 2=1 


( 6 ) 


where Si is the Lagrange multiplier for the constrain of 1 — yi (w T A^v) < £j, 
and ti is the Lagrange multiplier for the constrain of > 0. According to 
the dual theory of optimization, the following dual optimization problem is 
obtained, 


max min £(w, v, £, <$, e) 

S.e w,V,£ 


S.t. d > 0 , 6 > 0 , 


(7) 


where d = [<5i, ■ ■ ■ , (5 n ] T , and e = [ei, • • • , e„] T . By setting the partial deriva¬ 
tive of C with regard to w to zero, we have 


dC 


— = 0 => w = ^2 SiyiXiVi. 


dw 

2=1 

By setting the partial derivative of C with regard to to zero, we have 

dC 

77— — 0 =$* ex Si e, — 0 
dii 

—r* {X Si — &i 

f-i > 0 =>• a > Si. 


( 8 ) 


( 9 ) 


Substituting Q and 0 to Ql, we eliminate w and S 



Title Suppressed Due to Excessive Length 


7 


max min ^ -- ^ <Wjl {iVjvJ XTXjVj + 0 


i,j= 1 


i= 1 


INI 1 + F ^ 


i=l 

si. a > 8 > 0. 


i=1 


( 10 ) 


where a = [a, • • • , a] T is a n dimensional vector of all a elements. It is difficult 
to solve this dual problem with a close form solution. We try to solve it with 
the alternate optimization strategy. In each iteration of an iterative algorithm, 
we fix 8 first to solve V, and then fix V to solve 8. 


2.2.1 Solving V while fixing 8 

When 8 is fixed and only V is considered, the problem in 

n n 

~2 J2 ^jViVj^i X 7 + PYl ll Xi “ XjXi Hi 

*>i= 1 * =1 

(n) 

Instead of solving V at one time, we solve v,;|(' =1 one by one. When the con¬ 
textual reconstruction vector of the Tth point Vj is considered, we fix that of 
all other points Vjly^j. m is further reduced to 


(flUl) is reduced to 


+ 7^] l|v»[|i| • 


1 

~2 J2 WiViyovl X J x 3 w i + P^i- X i W i\\l + l\\Vi\\l >. (12) 

i,j=1 J 

This problem could be solved efficiently by the modified feature-sign search 
algorithm proposed by Gao et al. 0. 

2.2.2 Solving 8 while fixing V 

When V is fixed and only 8 is considered, the problem in m is reduced to 

n n 

~2 H 'VV'/dbv, -V, X jVj - 6 ' 

ij=1 i=1 

s.t. a > 8 > 0. 

This problem is a typical constrained quadratic programming (QP) problem, 
and it can be solved efficiently by the active set algorithm. 
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2.3 Iterative algorithm 

The iterative algorithm to learn both the classifier parameter w and the con¬ 
textual reconstruction coefficient vectors in V is given in Algorithm 1 . As we 
can see from the algorithm, the iterations are repeated T times and then the 
updated V and S are outputs. Please note that the variables of this algorithm 
are initialized randomly. 

Algorithm 1: Iterative Learning algorithm. 

Input Training point set {xj}" =1 and label set {j/i}" =1 ; 

Input Nearest neighbor size parameter k\ 

Input Tradeoff parameters a, /3 and 7 ; 

Input Maximum iteration number T. 

Initialization Find nearest neighbors {x,j}*_j for each data point x,, i = 

1 ) •'' > re¬ 
initialization Initialize <5° randomly; 

For t = 1, • • • T 

1. Fix and update the contextual reconstruction coefficient vec¬ 
tors v‘| ” =1 one by one by solving the problem in 02 ); 

2. Fix v ‘|" =1 and update the classifier parameter vector w 4 by solving 
6 t as in (THU) . 

Endfor 

Output classifier parameter vector w = 5jyiXivJ . 


2.4 Classifying a test point 

When a new test point x £ comes, to represent its context, we also find its k 
nearest neighbors from the training set and put them in a dxk matrix X. Given 
a classifier parameter vector w, and a candidate class label y £ {+1, —1}, we 
seek its class conditional context reconstruction coefficient vector, by solving 
the following minimization problem, 

v y = argmin {-j/w T (Xv) + /?||x - Xv\\l + 7 ||v||i} . ( 14 ) 

This problem can also be solved by the modified feature-sign search algorithm 
proposed by Gao et al. [ 8 i. The final class label y* of the test data point is 
obtained as the candidate label minimizing the following objective, 

y* = {Xvy)}. (15) 


3 Experiments 

In this section, we evaluate the proposed supervised sparse context learning 
(SSCL) algorithm on several benchmark data sets. 



Title Suppressed Due to Excessive Length 


9 


3.1 Data sets 

In the experiments, we used three date sets, which are introduced as follows: 


— MANET loss data set: The packet losses of the receiver in mobile Ad 
hoc networks (MANET) can be classified into three types, which are wire¬ 
less random errors caused losses, the route change losses induced by node 
mobility and network congestion. It is very important to recognize which 
class a packet loss belongs in research and application of mobile Ad hoc 
networks. The first data set used in our experiments is a MANET loss data 
set. To construct this data set, we simulate a MANET scenario by using 
a network simulator NS-2 mm- We put 30 nodes in a 400m x 800m 
area, select a TFRC flow as the observation stream, and a TCP flow as 
the background traffic between two randomly selected nodes. The random 
error rate is confided from 1% to 10%. We collect 381 data points for the 
congestion loss, 458 for the route change loss, and 516 data points for the 
wireless error loss. Thus in the data set, there are 1355 data points in total. 
To extract the feature vector each data point, we calculate 12 features from 
each data point as in |4], and concatenate them to form a vector. 

— Twitter data set: The second data set is a Twitter data set. The target 
of this data set is to predict the gender of the twitter user, male or female, 
given one of his/her Twitter massage. To construct this data set, we down¬ 
loaded Twitter massages of 50 male users and 50 female users of 100 days. 
We collected 53,971 twitter massages in total, and among them there are 
28,012 messages sent by male users, and 25,959 messages sent by female 
users. To extract features from each Twitter message, we extract Term fea¬ 
tures, linguistic features, and medium diversity features as gender-specific 
features as in m- 

— Arrhythmia data set: The third data set is publicly available at http://arc 
hive.ics.uci.edu/ml/datasets/Arrhythmia. In this data set, there are 452 
data points, and they belongs to 16 different classes. Each data point has 
a feature vector of 279 features. 


3.2 Experiment setup 

To conduct the experiments, we used the 10-fold cross validation. A entire 
data set is split into 10 folds, and each of them was used as a test set in turn. 
The remaining 9 folds are combined and used as a training set. The learning 
algorithm was applied to the training set to learn the classifier parameter. 
The algorithm is adjusted by using a 9-fold cross validation on the training 
set. The learned classifier was then applied to the test set to predict the class 
labels of the testing data points. The prediction performance is evaluated by 
the prediction accuracy, which is defined as, 
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Number of correctly predicted testing data points 

Prediction accuracy =-—- - - - --- - ---. 

I otal number of testing data points 

( 16 ) 


3.3 Results 

In the experiments, we first compare the proposed context-based data rep¬ 
resentation and classification algorithm, SSCL, to several context-based data 
representation and/or classification methods. Then we study the sensitivity 
of the proposed algorithm to its parameters experimentally. Finally, we study 
the convergency of the proposed iterative algorithm. 

3.3.1 Comparison to context-based representation and classification methods 

Since the proposed algorithm is a context-based classification and sparse rep¬ 
resentation method, we compared the proposed algorithm to three popular 
context-based classifiers, and one context-based sparse representation method. 
The three context-based classifiers are traditional KNN, Wright et al.’s SRBC 
148) . and Melacci and Belkin’s LSVM ,25]. The context-based sparse repre¬ 
sentation method is Gao et al.’s LSC |S|. The boxplots of the 10-fold cross 
validation of the compared algorithms are given in figure [TJ From the figures, 
we can see that the proposed method SSCL outperforms all the other methods 
on all three data sets. Among median values of the boxplots of prediction ac¬ 
curacies over three data sets, SSCL are always the highest one. In most cases, 
the 25-th percentiles of SSCL is even higher than the median values of other 
algorithms. The second best method is SRBC, which also uses sparse context 
to represent the data point. However, compared to SSCL, it doesn’t learn any 
explicit classifier for the classification problem. Thus it cannot take advantage 
of the classifier design tricks. This is the mean reason that SRBC inferior to 
SSCL. KNN also uses context to classify a data point without using a explicit 
classifier. However, unlike SRBC whose context is class-conditional, KNN uses 
a general context and treats all contextual data points equally, and obtains the 
worst classification results. This is a strong evidence that learning a supervised 
sparse context is critical for classification problem. LSVM also uses context 
information to regularize the learning of classifier. However, once the classifier 
is learned, the context is ignored in the classification procedure, thus its per¬ 
formance is inferior to SSCL. LSC is an unsupervised learning algorithm, and 
it is not surprising that its performance is not good. 

3.3.2 Sensitivity to parameters 

In the proposed formulation, there are three tradeoff parameters, a , /3, and 7 . 
Moreover, we have one more parameter, which is the size of the neighborhood, 
/c. It is interesting to investigate how these parameters effects the performance 
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MANET loss data set 



(a) MANET loss data set 


Twitter data set 



0.4 


SSCL KNN SRBC LSVM LSC 

(b) Twitter data set 


Arrhythmia data set 



(c) Arrhythmia data set 


Fig. 1 Boxplots of prediction accuracy of different context-based algorithms. 


of the proposed algorithm. We plot the curve of mean prediction accuracies 
against different values of parameters, and show them in figure [2] From figure 
2(a) and |2(b)| we can see the accuracy is stable to the parameter a and /3. 


More specifically, in figure |2(a)| it seems that the performances are a little 
better with a median value of a. a is the weight of the hinge loss function, 
and when it makes sense the classifier has a better performance with a median 
value, since a too large values leads to over-fitting, while a too small value leads 
to training error over the training set. It is also interesting to note that f5 also 
achieves the best performance with a median value, 10. /3 is the weight of the 
reconstruction error term. A small weight of this term makes the representation 
of a data point irrelevant to itself, while a large weight does not grantee its 
discriminative ability. From figure 2(c) and |2(d)| we can see a larger 7 or 
k leads to better classification performances. 7 is the weight of the sparsity 
term, a larger 7 achieves a higher prediction accuracy means prediction result 
benefits from a sparsity representation. This is because that in the context 
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(a) a 


(b)/3 



(c) 7 (d) k 

Fig. 2 Parameter sensitivity curves. 


of a data point, only a few data points plays important roles. Sparsity of the 
context forces the model to select those important contextual data points, k 
is the size of the context, and a larger k provides more candidate contextual 
data points, and helps the model to find the critical contextual data points. 

3.3.3 Algorithm convergency 

We are also interested in the convergency of the proposed iterative algorithm 
SSCL. We plot the objective function of the formulation in © in different 
iterations, and show the convergency curve in figure [31 From this figure, it is 
clear that the algorithm converge after the 50-th iteration. 

3.3.4 Running time analysis 

We also provide an analysis of the running time of the compared algorithms 
over the MANET loss data set. The running time of the algorithms is given in 
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Fig. 3 Convergency curve of the proposed SSCL algorithm. 


figure 01 The unit of the running time is second. From the figure, we can see 
that the least time consuming algorithm is KNN, however, its classification 
performance is poor. Our algorithm, SSCL, is the second least time consum¬ 
ing algorithm. It takes no more than 250 seconds, while all other algorithms 
take more than that. Moreover, SSCL achieves the best classification results. 
It leads to the conclusion that the proposed algorithm can achieve the best 
classification performance with a reasonable running time. 


Running time analysis over MANET loss data set 



SSCL KNN SRBC LSVM LSC 


Fig. 4 Running time of different algorithms over MANET loss data set. 
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4 Conclusion and future works 

In this paper, we study the problem of using context to represent and classify 
data points. Our motivation is that although data points in context of a data 
point plays important roles in its classification, only a few of them is critical. 
Thus it is necessary to learn a sparse context. To this end, we propose to use 
a sparse linear combination of the data points in the context of a data point 
to represent itself. Moreover, to increase the discriminative ability of the new 
representation, we develop an supervised method to learn the sparse context 
by learning it and a classifier together in an unified optimization framework. 
Experiments on three benchmark data sets show its advantage over state-of- 
the-art context-based data representation and classification methods. 

Although the proposed method works well for small data sets, it cannot 
scale up to large data set. The reason is that in each iteration, it solves a QP 
problem with regard to the number of data points in (1131) . This procedure 
works with small number of data points, however, when it is large, it is too 
consuming to solve such a QP problem with so many variables. In the future, 
we will investigate to release this QP problem to a linear problem, by using 
the expectation-maximization (EM) framework to release the hinge loss to a 
linear function. Moreover, we also plan to extend the proposed algorithm to 
different applications, e.g., bioinformatics [331I31I3HK7| . computer vision (321 
mrnm, and information retrieval P1HMTT1I551 . 
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